Closed Bug 1598342 Opened 6 years ago Closed 5 years ago

TSan: data race nsprpub/pr/src/malloc/prmem.c:476:9 in PR_Free

Tracking

(firefox75 fixed)

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox75

---

fixed

People

(Reporter: decoder, Assigned: decoder)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

tsan-prfree-66531295716a.txt 6 years ago Christian Holler (:decoder) 7.20 KB, text/plain		Details
Bug 1598342 - Import TSan fix D74828 from Clang upstream. r?froydnj 5 years ago Christian Holler (:decoder) 47 bytes, text/x-phabricator-request		Details \| Review

Christian Holler (:decoder)

Assignee

Description

•

6 years ago

Attached file tsan-prfree-66531295716a.txt — Details

I'm seeing the attached race all over the place when trying to run Mochitests with TSan and it is by far the most frequent failure. I'm not sure if this can cause any actual problems like a hang during thread shutdown, or not.

(Away)

Comment 2

•

6 years ago

If I understand correctly (someone should double check this):

There's a thread T54 that is finished with its work and is cleaning up here
The main thread is waiting in pthread_join for T54 to exit
T54 exits and the main thread proceeds
Main thread free()s the control data for T54

And TSan is complaining about a race between steps 1 and 4? Maybe it doesn't know about the join relationship in between? Is there a way to annotate this?

(This is assuming nothing super weird like if there was a mixup of thread ID and we're free() ing the wrong thing)

Christian Holler (:decoder)

Assignee

Updated

•

6 years ago

Blocks: tsan

Christian Holler (:decoder)

Assignee

Comment 3

•

6 years ago

(In reply to :dmajor from comment #2)

Maybe it doesn't know about the join relationship in between? Is there a way to annotate this?

TSan knows the primitives, it should not be necessary to annotate anything unless we use some kind of lock/sync mechanisms that are not builtin (which we don't, afaik).

To me it seems like if there is a race between 1. and 4., that could be for sure a real problem. If there is a race between the thread still reading its private data somehow and the main thread freeing it, I don't see how that would be expected. For now, I will suppress this, but to me this does not sound like expected behavior (but I am not an expert on our NSPR thread implementation).

(Away)

Comment 4

•

6 years ago

To me it seems like if there is a race between 1. and 4., that could be for sure a real problem. If there is a race between the thread still reading its private data somehow and the main thread freeing it, I don't see how that would be expected.

Agreed for sure. But does such a race exist in this particular case? It seems like the main thread is known not to reach free() until after the other thread finishes step 1. But this is my first time looking at this code so I could be misunderstanding it.

Christian Holler (:decoder)

Assignee

Comment 5

•

6 years ago

(In reply to :dmajor from comment #4)

Agreed for sure. But does such a race exist in this particular case? It seems like the main thread is known not to reach free() until after the other thread finishes step 1. But this is my first time looking at this code so I could be misunderstanding it.

If TSan reports such a race, then there is from my perspective only two options:

There is such a race.

There is an invisible (to TSan) locking mechanism that synchronizes this.

So if the main thread is "known not to reach free() until after the other thread finishes step 1" I would naively ask "how?". If there is a synchronization primitive in place doing this, then TSan would know. If there is a custom synchronizing mechanism (e.g. using inline assembly stuff) implemented by us or the synchronization is happening e.g. through atomics/primitives in an uninstrumented part of the code, then this would explain the report and it might be a false positive.

(Away)

Comment 6

•

6 years ago

So if the main thread is "known not to reach free() until after the other thread finishes step 1" I would naively ask "how?". If there is a synchronization primitive in place doing this, then TSan would know.

This is the mechanism:

The main thread is waiting in pthread_join for T54 to exit

The main thread can't reach free() until it returns from pthread_join, which can't happen until T54 exits, which can't happen until _PR_DestroyThreadPrivate returns back to pr_root.

I don't know if that counts as a synchronization primitive.

Nathan Froyd [:froydnj]

Comment 7

•

6 years ago

This is a false positive from TSan.

The report says there's a read when we're destroying NSPR's thread specific things on a particular thread just before we exit the NSPR toplevel thread routine. It then complains that we're writing to that location during joining of a thread.

The sequence of events looks something like:

Main thread: request join of thread T1 through PR_JoinThread.
MT: waits somewhere in libpthread or similar.
T1: Finishes internal nsThread routine.
T1: returns to https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#201
T1: bookkeeping to clean up: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#203-235
T1: destroys internal NSPR thread stuff: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#248

We are now at the point where TSan records the initial read of the memory location.

T1: finishes destroying, exits _pt_root, returns back to whatever libpthread bits started the thread in the first place.
T1: signals any waiters that the thread has exited.

At this point, T1 is finished. No further memory activity can come from T1.

MT: gets woken up by said signal from the previous step.
MT: resumes: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#587
MT: calls _pt_thread_death_internal: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#590-594

Note that we're not running the thread-local data destructor(s) here, as per comment: the comment is referencing the point where TSan found the initial memory write.

MT: we free the underlying thread structure: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#912
MT: we call free(3) on the pointer: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/malloc/prmem.c#476
MT: we're writing to the pointer (?) in TSan's hooks (?) Or is this just our scribbling over freed memory in mozjemalloc or something?

So TSan is complaining that we're writing to memory that T1 read from, and that write could have been important to T1, so we should have synchronized. But we did synchronize: the main thread can't get to the point where it would be writing until T1 told it to proceed. TSan even intercepts pthread_join and pthread_create, so it should have some inkling of what's going on here.

BugBot [:suhaib / :marco/ :calixte]

Comment 8

•

5 years ago

The priority flag is not set for this bug.
:KaiE, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kaie)

Kai Engert [:KaiE:]

Comment 9

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #7)

This is a false positive from TSan.

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(kaie)

Priority: -- → P2

Resolution: --- → INVALID

Christian Holler (:decoder)

Assignee

Comment 10

•

5 years ago

(In reply to Kai Engert (:KaiE:) from comment #9)

(In reply to Nathan Froyd [:froydnj] from comment #7)

This is a false positive from TSan.

It would be good to leave this open because we are still trying to mitigate this somehow.

Our current status is that this is due to how NSPR is built and that it can maybe be fixed in TSan but if not, we might have to use a custom fix in NSPR maybe. It is not clear yet how we can easily resolve this, but suppressing this race is very costly because it is so frequent.

I'm currently planning to test with a patched Clang to see if dvyukov's suggested change in TSan fixes the problem.

Assignee: nobody → choller

Status: RESOLVED → REOPENED

Priority: P2 → P5

Resolution: INVALID → ---

Christian Holler (:decoder)

Assignee

Comment 11

•

5 years ago

Just a quick status update: I am still in touch with the TSan developers to figure out if there is anything we can do to make this easily work.

If that doesn't work out, I just had another idea:

We would try to compile-time blacklist _pt_root and _PR_DestroyThreadPrivate by annotating them in the code. When building with TSan, we would also have to replace the memset in _PR_DestroyThreadPrivate with an inlined version because that is intercepted otherwise.

Christian Holler (:decoder)

Assignee

Comment 12

•

5 years ago

The problem has been diagnosed by the TSan developers and a fix is available at https://reviews.llvm.org/D74828

In short, the problem was caused by our use of called_from_lib suppression feature. Apparently, this feature caused TSan to disable some vital interceptors for pthread functions, which desynchronized the internal thread state of TSan.

I will try to backport the patch to our version of Clang 9 so we can close this bug. Getting rid of these races should give us a nice performance boost for TSan because the suppression stats show that this issue is by far the most frequent to occur.

Christian Holler (:decoder)

Assignee

Comment 13

•

5 years ago

Attached file Bug 1598342 - Import TSan fix D74828 from Clang upstream. r?froydnj — Details

Christian Holler (:decoder)

Assignee

Comment 14

•

5 years ago

This is green on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0ac628bb4064629ddbbc7809b3e3b650a81668d

Pulsebot

Comment 15

•

5 years ago

Pushed by choller@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/99f74aa3cae9 Import TSan fix D74828 from Clang upstream. r=froydnj

Andrei Ciure[:aciure]

Comment 16

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/99f74aa3cae9

Status: REOPENED → RESOLVED

Closed: 5 years ago → 5 years ago

status-firefox75: --- → fixed

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

TSan: data race nsprpub/pr/src/malloc/prmem.c:476:9 in PR_Free

Categories

(NSPR :: NSPR, defect, P5)

Tracking

(firefox75 fixed)

People

(Reporter: decoder, Assigned: decoder)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type